Random Forest in Machine Learning

Forest Foresight

Mika Goins
Matt McGehee
Stutti Smit-Johnson
Advisor: Dr. Seals

Random Forest in Machine Learning: Foresight from the Forest

A Random Forest Guided Tour

by Gérard Biau and Erwan Scornet [1]

  • Origin & Success: Introduced by Breiman (2001) [2], Random Forests excel in classification/regression, combining decision trees for strong performance.
  • Versatility: Effective for large-scale tasks, adaptable, and highlights important features across various domains.
  • Ease of Use: Simple with minimal tuning, handles small samples and high-dimensional data.
  • Key Mechanisms: Uses bagging and CART-split criteria for robust performance, though hard to analyze rigorously.
  • Theoretical Gaps: Limited theoretical insights; known for complexity and black-box nature.

Tree Prediction

Each tree estimates the response at point \(x\) as:

\[ m_n(x; \Theta_j, D_n) = \sum_{i \in D_n(\Theta_j)} \frac{\mathbf{1}_{X_i \in A_n(x; \Theta_j, D_n)} Y_i}{N_n(x; \Theta_j, D_n)} \]

  • \(D_n(\Theta_j)\) is the resampled data subset,
  • \(A_n(x; \Theta_j, D_n)\) is the cell containing \(x\), and
  • \(N_n(x; \Theta_j, D_n)\) is the count of points in the cell

Random Forest Classification

Splitting Criteria:

  • The Gini impurity measure is used to determine the best split:

\[ G = 1 - \sum_{k=1}^{K} p_k^2 \]

  • \(p_k\) represents the proportion of samples of class \(k\) in the node.
  • \(K\) is the number of classes.

Prediction:

  • Each tree makes a prediction using the majority class in the cell containing \(x\).
  • Classification uses a majority vote:

\[ m_{M, n}(x; \Theta_1, \ldots, \Theta_M, D_n) = \begin{cases} 1 & \text{if } \frac{1}{M} \sum_{j=1}^{M} m_n(x; \Theta_j, D_n) > \frac{1}{2} \\ 0 & \text{otherwise} \end{cases} \]

  • \(m_n(x; \Theta_j, D_n)\): Prediction from the \(j\)-th tree.
  • \(M\): Total number of trees in the forest.

Like this…

[3]

The Data

  1. Where the data came from – FL based food packaging company

  2. Size of data – 33,818 records, 17 variables

  3. Key variables - Demographic, Behavioral, Seasonal

  4. Preprocessing needed -

    1. TotalPrice – negative values
    2. Quantity – highly skewed
    3. Substrate – missing values

Data Schema

Data Dictionary
Attribute Format Description
OPCO Varchar The customer placing the order. In this case, typically a Distributor.
SalesOrderID Varchar Unique identifier assigned to each sales order.
CustomerPO Varchar Customer’s identifier of their order sent to BCC.
Product Varchar Unique identifier assigned to each product.
Description Varchar Description of the product being sold.
Substrate Varchar Type of product/material.
RequestedDeliveryDate Varchar Date the delivery was scheduled originally.
DateFulfilled Varchar Date the delivery was made.
qtyOrdered Numeric Quantity ordered on the order.
qtyFulfilled Numeric Quantity delivered on the order.
UnitPrice Numeric Price of each case of product SSI charges the customer.
TotalPrice Numeric Total price of the sales order.
Class Varchar Customer name
ShipToName Varchar Address name of ordering party
ShipToAddress Varchar Address where order needs to be delivered
SalesOrderStatus Varchar Status of Sales order
SalesItemStatus Varchar Status of each line item on the sales order

Sales over Time

Analysis - Stutti

Predicting Customer Churn

  • The churn indicator was created based on the Last Sales Date (0/1).
  • Predictors: Class, Product, Qty Ordered, and Date Fulfilled.
  • The model was evaluated using statistics from the Confusion Matrix.
  • 80% Accuracy achieved:
    • Sensitivity: The model correctly identifies 78.6% of the actual 0 cases.
    • Specificity: The model correctly identifies 88.12% of the actual 1 cases.
    • Negative Predictive Value (NPV for class 1): When the model predicts 1, it is correct only 47.62% of the time. This lower NPV suggests the model might be missing some 1 cases.
    • McNemar’s Test P-value (<2e-16): Indicates that the model struggles slightly with misclassification between classes.
  • Conclusion: Overall, the model has a good balance (0.8336) between identifying both classes, though it is better at predicting class 0.

ROC Curve

Analysis - Matt

Goal:

Predict whether an OPCO (distributor) falls within the top 25% of revenue using the quantity ordered, product, and substrate to try and identify key distribution channels where the company could focus marketing and advertising dollars.

Confusion Matrix

  • 0: Non-high-revenue OPCO
  • 1: High-revenue OPCO

Model Statistics

Metric Value
Accuracy Accuracy 0.956
95% CI (0.951, 0.96)
Kappa Kappa 0.73
Sensitivity Sensitivity 0.98911
Specificity Specificity 0.66256
Pos Pred Value Pos Pred Value 0.96291
Neg Pred Value Neg Pred Value 0.87297
Prevalence Prevalence 0.89854
Detection Rate Detection Rate 0.88876
Balanced Accuracy Balanced Accuracy 0.82583

ROC Curve Analysis

Figure 1: ROC Curve for High Revenue Prediction

Feature Importance

Figure 2: Feature Importance for High Revenue Prediction

Analysis - Mika

Conclusion

  • High Performance & Versatile: Robust, accurate, handles noisy/high-dimensional data. [1]
  • Ensemble Strength: Averaging over many trees; identifies key variables. [1]
  • Challenges: Limited theoretical understanding; complex interpretation. [1]
  • Future Focus: Enhance theory, increase interpretability, broaden applications. [1]

References

[1]
G. Biau and E. Scornet, “A random forest guided tour,” Test (Madr.), vol. 25, no. 2, pp. 197–227, Jun. 2016.
[2]
[3]
Y. Fu, “Combination of random forests and neural networks in social lending,” Journal of Financial Risk Management, vol. 6, no. 4, pp. 418–426, 2017, doi: 10.4236/jfrm.2017.64030.